Mh/full pipeline diffusion adjusted scores#2491
Conversation
|
@moritzhauschulz : can we also open this against develop please. |
MatKbauer
left a comment
There was a problem hiding this comment.
Thanks, Moritz! I'm currently testing this. When I try to launch a training on this branch with
uv run --offline train --base-config config/config_diffusion_d2048_forecast.yml
I'm getting an error that seems related to empty predictions:
Traceback (most recent call last):
File "/e/project1/e-ext-2025e01-128/karlbauer1/repos/WeatherGenerator/src/weathergen/run_train.py", line 193, in run_train
trainer.run(cf, devices)
File "/e/project1/e-ext-2025e01-128/karlbauer1/repos/WeatherGenerator/src/weathergen/train/trainer.py", line 429, in run
self.train(mini_epoch)
File "/e/project1/e-ext-2025e01-128/karlbauer1/repos/WeatherGenerator/src/weathergen/train/trainer.py", line 549, in train
self.grad_scaler.step(self.optimizer)
File "/e/project1/e-ext-2025e01-128/karlbauer1/repos/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/amp/grad_scaler.py", line 454, in step
len(optimizer_state["found_inf_per_device"]) > 0
AssertionError: No inf checks were recorded for this optimizer.
[3] > /e/project1/e-ext-2025e01-128/karlbauer1/repos/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/amp/grad_scaler.py(454)step()
-> len(optimizer_state["found_inf_per_device"]) > 0
This does not happen to me on jk/develop/diffusion-full-pipeline, but in the changed files of this PR, I can't quite see where the problem comes from. @moritzhauschulz, do you have an idea?
Thanks for taking a look @MatKbauer. This is indeed strange, and I am getting the same. I have seen this error before but I don't recall when... On first glance it seems to be unrelated to my changes (which is why I didn't test the training, whoops). However, I am getting the same currently on |
ad88e05 to
a26fded
Compare
246f05e
into
ecmwf:jk/develop/diffusion-full-pipeline
Description
This PR basically does two things:
This should probably be reviewed by someone from the eval team (maybe @iluise ?).
With this fix, and the fix in the eval config, I can now produce CRPS, SSR and spread plots and maps (pretty much out of the box).
Issue Number
Is this PR a draft? Mark it as draft.
Checklist before asking for review
./scripts/actions.sh lint./scripts/actions.sh unit-test./scripts/actions.sh integration-testlaunch-slurm.py --time 60